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Abstract. We study the three state perfect phylogeny problem and establish 
a generalization of the four gamete condition (also called the Splits Equivalence 
Theorem) for sequences over three state characters. Our main result is that a 

set of input sequences over three state characters allows a perfect phylogeny 
if and only if every subset of three characters allows a perfect phylogeny. In 
I I establishing these results, wc prove fundaiueutal structural features of the per- 

feet phylogeny problem on three state characters and completely characterize 

Uthe minimal obstruction sets that must occur in three state input sequences 
that do not have a perfect phylogeny. We further give a proof for a stated 
fH lower bound involved in the conjectured generalization of our main result to 

4—* any number of states. 

Until this work, the notion of a conflict, or incompatibility, graph has been 
defined for two state characters only. Our generalization of the four gamete 
I I condition allows us to generalize the notion of incompatibility to three state 

characters. The resulting incompatibility structure is a hypergraph, which can 
be used to solve algorithmic and theoretical problems for three state characters. 

> 



1. Introduction 



One of the fundamental problems in biology is the construction of phylogenies, 
or evolutionary trees, to describe ancestral relationships between a set of observed 
taxa. Each taxon is represented by a sequence and the evolutionary tree provides 
an explanation of branching patterns of mutation events transforming one sequence 
into another. There have been many elegant theoretical and algorithmic results on 
the problems of reconstructing a plausible history of mutations that generate a 
. 5^ given set of observed scqu(;nc;es and determining the minimum number of such 

events needed to explain the sequences. 
H A widely used model in phylogeny construction and population genetics is the 

infinite sites model, in which the mutation of any character can occur at most 
once in the phylogeny. This implies that the data must be binary (a character 
can take on at most two states), and that without recombination the phylogeny 
must be a tree, called a (binary) Perfect Phylogeny. The problem of ckrtc^rniining 
if a set of binary sequences fits the infinite sites model without recombination, 
corresponds to determining if the data can be derived on a binary Perfect Phylogeny. 
A generalization of the infinite sites model is the infinite alleles model, where any 
character can mutate many times but each mutation of the character must lead to a 
different allele (state). Again, without recombination, the phylogeny is tree, called 
a multi-state Perfect Phylogeny. Correspondingly, the problem of determining if 
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multi-state data fits the infinite-alleles model without recombination corresponds 
to determining if the data can be derived on a multi-state perfect phylogeny. 

In the case of binary sequences, the well-known Splits Equivalence Theorem (also 
known as the four gamete condition) gives a necessary and sucient condition for 
the existence of a (binary) perfect phylogeny. 

Theorem 1.1 (Splits Equivalence Theorem, Four Gamete Condition 

[TTJ lini HZ] ) . A perfect phylogeny exists for binary input sequences if and only if no 

pair of characters contains all four possible binary pairs 00, 01, 10, 11. 

It follows from this theorem that for binary input, it is possible to either construct 
a perfect phylogeny, or output a pair of characters containing all four gametes as 
an obstruction set witnessing the nonexistence of a perfect phylogeny. This test is 
the building block for many theoretical results and practical algorithms. Among 
the many applications of this theorem, Gusfield et al. |T5J HH] and Huson et al. [2|] 
apply the theorem to achieve decomposition theorems for phylogenies, Gusfield, 
Hickerson, and Eddhu 21] Bafna and Bansal [DI2]j and Hudson and Kaplan ^3] 
use it to obtain lower bounds for recombination events, Gusfield et al. [17l [20] use 
it to obtain algorithms for constructing networks with constrained recombination, 
Sridhar et al. [51 [351 [33] and Satya et al. ^25] use it to achieve a faster near-perfect 
phylogeny reconstruction algorithm, Gusfield [T5] uses it to infer phase inference 
(with subsequent papers by Gusfield et al. [31 [H [HI [H], Eskin, Halperin, and 
Karp [I0l[22], Satya and Mukherjee [28] and Bonizzoni [T), and Sridhar [31] et al. 
use it to obtain phylogenies from genotypes. 

The focus of this work is to extend results for the binary perfect phylogeny prob- 
lem to the multiple state character case, addressing the following natural questions 
arising from the Splits Equivalence Theorem. Given a set of sequences on r states 
('^ ^ 3), is there a necessary and sufficient condition for the existence of a perfect 
phylogeny analogous to the Splits Equivalence Theorem? If no perfect phylogeny 
exists, what is the size of the smallest witnessing obstruction set? 

In 1975, Fitch gave an example of input S over three states such that every pair 
of characters in S allows a perfect phylogeny while the entire set of characters S 
does not J21 [13 EH [SH] • In 1983, Meacham generalized these results to characters 
over r states (r > 3) [27], constructing a class of sequences called Fitch-Meacham 
examples, which we examine in detail in Section [7] Meacham writes: 

"The Fitch examples show that any algorithm to determine whether 
a set of characters is compatible must consider the set as a whole 
and cannot take the shortcut of only checking pairs of characters." 

m 

However, while the Fitch-Meacham construction does show that checking pairs 
of characters is not sufficient for the existence of a perfect phylogeny, our main 
result will show that for three state input, there is a sufficient condition which does 
not need to consider the entire set of characters simultaneously. In particular, we 
give a complete answer to the questions posed above for three state characters, by 

(1) showing the existence of a necessary and sufficient condition analogous to 
the Splits Equivalence Theorem (Sections [3] and |4| , 

(2) in the case no perfect phylogeny exists, proving the existence of a small 
obstruction set as a witness (Section B , 
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(3) giving a complete characterization of all minimal obstruction sets (Section 
[5]), and 

(4) proving a stated lower bound involved in the conjectured generalization of 
our main result to any number of states (Section |7|. 

In establishing these results, we prove fundamental structural features of the 
perfect phylogeny problem on three state characters. 

2. Perfect Phylogenies and Partition Intersection Graphs 

The input to our problem is a set of n sequences (representing taxa) , where each 
sequence is a string of length m over r states. Throughout this paper, the states 
under consideration will be the set {0, 1,2,... r — 1} (in particular, in the case r = 2, 
the input are binary sequences over {0,1}). The input can be considered as a matrix 
of size n X m, where each row corresponds to a sequence and each column corre- 
sponds to a character (or site). We denote characters by C = {x^iX^iX^: ■ ■ ■ X™} 
and the states of character by Xj for < j < r — 1. A species is a sequence 
si,S2, ■ ■ ■ s„i e x]-^ Xj2 X ■ ■ • xj^ I where Si is the state of character x* for s. 

The perfect phylogeny problem is to determine whether an input set S can be 
displayed on a tree such that 

(1) each sequence in input set S labels exactly one leaf in T 

(2) each vertex of T is labeled by a species 

(3) for every character and for every state x) of character x*, the set of all 
vertices in T such that the state of character x* is x) forms a connected 
subtree of T. 

The general perfect phylogeny problem (with no constraints on r, n, and m) is 
NP-complete [?, ?]. However, the perfect phylogeny problem becomes polynomially 
solvable (in n and m) when r is fixed. For r — 2, this follows from the Splits 
Equivalence Theorem |1.1[ For larger values of r, this was shown by Dress and 
Steel for r = 3 [?], by Kannan and Warnow for r = 3 or 4 [?], and by Agarwala 
and Fernandez-Baca for all fixed r [?] (with an improved algorithm by Kannan and 
Warnow [?]). 

Definition 2.1 (f8ll30]). For a set of input sequences S, the partition intersection 

graph G{S) is obtained by associating a vertex for each character state and an edge 
between two vertices Xj o,nd xf if there exists a sequence s with state j in character 
X* € C and state I in character x'' G C. We say s is a row that witnesses edge 
iXj^xf)- For a subset of characters $ ~ {x*S x'^ 7 • ■ • X**"}? 'ei G{^) denote the 
partition intersection graph G{S) restricted to the characters in <i>. 

Note that by definition, there are no edges in the partition intersection graph 
between states of the same character. 

Definition 2.2. A graph H is chordal, or triangulated, if there are no induced 
chordless cycles of length four or greater in H . 

Consider coloring the vertices of the partition intersection graph G{S) in the fol- 
lowing way. For each character x\ assign a single color to the vertices Xo: Xi: ■ • • Xr-i 
A proper triangulation of the partition intersection graph G{S) is a chordal super- 
graph of G{S) such that every edge has endpoints with different colors. In [8], 
Buneman established the following fundamental connection between the perfect 
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phylogeny problem and triangulations of the corresponding partition intersection 
graph. 

Theorem 2.3. [51 [5U] A set of taxa S admits a perfect phylogeny if and only if the 
corresponding partition intersection graph G{S) has a proper triangulation. 

We will use Theorem |2.3| to extend the Splits Equivalence Theorem to a test for 
the existence of a perfect phylogeny on trinary state characters. In a different direc- 



tion, Theorem 2.3 and triangulation were also recently used to obtain an algorithm 
to handle perfect phylogeny problems with missing data [?] . 

To outline our approach, suppose a perfect phylogeny exists for S and consider 
every subset of three characters. Then each of these (™) characters also has a 
perfect phylogeny. We show that this necessary condition is also sufficient and 
moreover, we can systematically piece together the proper triangulations for each 
triple of characters to obtain a triangulation for the entire set of characters. On 
the other hand, if no perfect phylogeny exists, then we show there exists a witness 
set of three characters for which no perfect phylogeny exists. This extends the 
Splits Equivalence Theorem to show that for binary and trinary state input, the 
number of characters needed for a witness obstruction set is equal to the mimber 
of character states. The following is the main theorem of the paper. 

Theorem 2.4. Given an input set S on m characters with at most three states per 
character (r < 3), S admits a perfect phylogeny if and only if every subset of three 
characters of S admits a perfect phylogeny. 

By this theorem, in order to verify that a trinary state input matrix S has a 
perfect phylogeny, it suffices to verify that partition intersection graphs G[x*, X"* : X*^] 
have proper triangulations for all triples X*,X"')X'^ ^ Section [rj we will show 

that the Fitch-Meacham examples [131 [57] demonstrate that the size of the witness 



set in Theorem 2.4 is best possible 



3. Structure of Partition Intersection Graphs for Three 

Characters 

We begin by studying the structure of partition intersection graphs on three 
characters with at most three states per character (m < 3, r < 3). For convenience, 
we will denote the three characters by the letters a, 6, c (interchangeably referring to 
them as characters and colors) and denote the states of these characters by a^, 6^, Ci 
(z €{0,1,2}). 

The problem of finding proper triangulations for graphs on at most three colors 
and arbitrary number of states (m — r arbitrary) has been studied in a series of 
papers ^6ji25j 26 . However, it will be unnecessary in our problem to employ these 
triangulation algorithms, as our instances will be restricted to those arising from 
character data on at most three states (m = 3,r < 3). In such instances, we will 
show that if a proper triangulation exists, then the structure of the triangulation is 
very simple. We begin by proving a sequence of lemmas characterizing the possible 
cycles contained in the partition intersection graph. 

Lemma 3.1. Let S be a set of input species on three characters a,b, and c with 
at most three states per character. Suppose every pair of characters induces a 
properly triangulatable partition intersection graph (i.e., G[a,b], G[6, c] and G[a,c] 
are properly triangulatable) and let C be a chordless cycle in G[a,b,c]. Then C 
cannot contain all three states of any character. 
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Figure 1. The row witnessing edge (60, Cq) must contain a state 
in character a . 



Proof. Suppose there is a color, say a, such that all three states ao,ai and 02 
appear in C. Note that C must contain all three colors a, b, and c (since any pair of 
colors induces a properly triangulatable graph and any cycle on two colors cannot 
be properly triangulated). We have the following cases. 

Case I. Suppose there is an edge e in C neither of whose endpoints have color a 
(without loss of generality, let e = (69, cq)). The row that witnesses this edge must 
contain some state in a, say aq. This implies that the vertices ap, fooj and cq form a 
triangle in G[a,b,c], a contradiction since C is assumed to be chordless (see Figure 
0. 

Case II. Otherwise, every edge has an endpoint of color a, implying each edge 
has color pattern either (a, 6) or (a,c). Since all three states of a appear, the color 
pattern up to relabeling must be as shown in Figure [2ja) (in the figure, color b 
appears twice and color c appears once). 




(a) (b) 

Figure 2. The row witnesses for edges (ao,co) and (00,02) must 
share the same state of b. 



In this case, the row witness for edge (oq, cq) must contain the final state 62 of b 
(otherwise there would be an edge between cq and either fog or 61, a contradiction 
since C is chordless). Similarly, the row witness for edge (cq, 02) must also be state 
&2. As shown in Figure [2|^b) , this gives a cycle (00,62), (62, 02), (02, &i), (^1,01), 
(ai,6o), {bo,ao) on two colors. Such a cycle is not properly triangulatable, and 
therefore G[a,6] is not properly triangulatable, a contradiction. 

Since Case I and Case II cannot occur, it follows that ao,ai and 02 cannot all 
appear in C, proving the lemma. □ 

Before stating the next lemma, we give the following definition. 
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Definition 3.2. Suppose the endpoints of edge e have colors x* and x' ■ Then any 
other edge whose endpoints also have colors and x' is called color equivalent to 
e. Two edges are called nonadjacent if they do not share a common endpoint. 

For example, the edges (01,02) and (co,ai) in Figure [l] are color equivalent and 
nonadjacent. 

Lemma 3.3. Let S he a set of input species on three characters a, 6, and c with 
at most three states per character. If the partition intersection graph G'[a, b, c] is 
properly triangulatahle, then for every chordless cycle C in G[a, 6, c], there exists a 
color (a,b, or c) that appears exactly once in C. 



Proof. Consider any chordless cycle C of G[a,h,c\. By Lemma 3.1 no color 
appears in all three states in C . To obtain a contradiction, suppose each color a, &, 
and c appears exactly twice in C and without loss of generality, relabel the states so 
that the vertices appearing on the cycle are oq, ai, 6o: ^ij co, and ci. We first show 
that C has a pair of nonadjacent edges that are color equivalent. Up to symmetry 
and relabelling of colors, there are two cases for the color pattern of C as follows. 

Case 1. There is a vertex in the cycle whose neighbors in the cycle have 
the same color. Up to relabeling, we can assume this vertex has color a 
(say in state Oq) and the two adjacent vertices have color h. The states for 
the remaining vertices of the cycle are oi, cq, and ci. Now, consider the 
vertices adjacent to 60 and 61 other than oq. These vertices must be cq 
and Ci (otherwise, the two states of c would be adjacent in the cycle). This 
color pattern is shown in Figure [sja). 

Case 2. No vertex in the cycle is adjacent to two vertices of the same color. 
Then the two neighbors of a vertex with color a must have colors h and 
c. Then the vertex following b in the cycle must have color c (otherwise 
vertex b is adjacent to two vertices of the same color). By working this way 
around the cycle, the only color pattern possible is as shown in Figure [Sj^b). 
Note that both color patterns contain a pair of nonadjacent and color equivalent 
edges (edges e and e' in Figure |3| . 





(b) 



Figure 3. 
e and e'. 



Color Patterns and nonadjacent color equivalent edges 



Consider this pair of nonadjacent and color equivalent edges e and e'. Without 
loss of generality, assume that the endpoints of these edges have colors b and c. 
Let s be the row witness for e and s' be the row witness for e'. Since cycle C 
is chordless, the state in character a of row s cannot be Oq or ai. Similarly, the 
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state in character a of row s' cannot be ao or ai. Since 02 is the only remaining 
state of character a, both s and s' must contain 02. This imphes that the partition 
intersection graph G[a, fo, c] must induce one of the two color patterns in Figure |4] 





3 b 


be 






S c 


c 










a, 


(a) 






(b) 



Figure 4. Induced Color Patterns 

In the case illustrated in Figure |4|a), there is a cycle on four vertices induced 
by the two characters a and b (see Figure [sj a)), implying G[a,b,c] is not properly 
triangulatable. In the case illustrated in Figure [4](b) , there are two edge-disjoint 
cycles of length four with color pattern a, b, a, c. Since edges in a proper triangula- 
tion cannot connect vertices of the same color, any proper triangulation of G must 
contain the two edges / and /' connecting vertices of color b and c (see Figure [5](b)). 
However, this induces a cycle of length four on the states of b and c, which does 
not have a proper triangulation. This again shows that G[a, 6, c] is not properly 
triangulatable. 





Figure 5. (a) Induced cycle of length four on two colors; (b) 
Forced Edges / and /' 

Since all of these cases result in contradictions, it follows that there exists a color 
that appears exactly once in C. □ 

Lemmas |3.1| and |3.3| show that if C is a chordless cycle in a properly triangu- 
latable graph G[a,b,c], then no color can appear in all three states and one color 
appears uniquely. This leaves two possibilities for chordless cycles in G[a, b, c] (see 
Figure |6]) : 

• a chordless four cycle, with two colors appearing uniquely and 
the remaining color appearing twice 

• a chordless five cycle, with one color appearing uniquely and 
the other two colors each appearing twice 

In the next lemma, we show that if G[a, b, c] is properly triangulatable, the second 
case cannot occur, i.e., G[a,b,c] cannot contain a chordless five cycle. 
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(a) (b) 



Figure 6. The only possible chordless cycles in G[a, b, c]: (a) char- 
acters a and b appear uniquely while character c appears twice; (b) 
character a appears uniquely while characters b and c each appear 
twice. 



Lemma 3.4. Let S be a set of input species on three characters a, b, and c with 
at most three states per character. If the partition intersection graph G[a, 6, c] is 
properly triangulatable, then G[a, 6, c] cannot contain chordless cycles of length five 
or greater. 



Proof. Lemmas 3.1 and 3.3 together show that G[a, 6, c] cannot contain chordless 



cycles of length six or greater, so it remains to show that G[a, b, c] cannot contain 
chordless cycles of length equal to five. 

Suppose C is a chordless cycle in G[a, b, c] of length five; without loss of generality, 
let a be the color appearing exactly once in G (say in state cq), let 6o,6i be the 
two states of 6 in C, and let Cq, Ci be the two states of c in C. Up to relabeling of 
the states, the cycle is as shown in Figure ^h) . 

Now, any proper triangulation of G[a, b, c] must triangulate cycle C by edges 
(ao,co) and (ao,6i) shown in Figure |7] (since the only other edge between nonadja- 
cent vertices of different colors is (607C1), which would create a non-triangulatable 
four cycle on the two colors b and c. 






Figure 7. Edges e and e' are both witnessed by state a^. 



The row witnesses for edges {bo, cq), (cq, 61), and (ci, 61) must contain a state in 
color a that is one of ai or 02 (otherwise, oq would have an edge to a non-adjacent 
vertex in cycle C, implying G is not chordless). Since there are three edges and 
two possible witness states in color a, there are two edges among (60, cq), (co,6i), 
(ci,&i) that share a witness a^. We denote these two edges by e and e'; as shown 
in Figure [7] there are three ways to choose e and e'. 
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Figure [8] shows that all three cases induce a four cycle on two colors, a contradic- 
tion since G[a,b,c] is properly triangulatable. Therefore, G[a,b,c] cannot contain 
a chordless 5-cycle. □ 




Figure 8. Forced cycles of length four on two colors. 



Lemma 3.5. Let S be a set of input species on three characters a, 6, and c with 
at most three states per character. If the partition intersection graph G[a, 6, c] is 
properly triangulatable, then every chordless cycle in G[a, b, c] is uniquely triangu- 
latable. 



Proof. By Lemma 3.4 if C is a chordless cycle in G[a, b, c], then C must be a four 
cycle with the color pattern shown in Figure [9] (up to relabeling of the colors). Then 
C is uniquely triangulatable by adding the edge between the two colors appearing 
uniquely (in Figure |9] these are colors a and b). 




Figure 9. Color pattern for chordless cycle G. 

□ 

For any three colors a, b, c. Lemma |3.5| gives a simple algorithm to properly 
triangulate G[a, b, c]: for each chordless cycle C in G[a, b, c], check that C is a four 
cycle with two nonadjacent vertices having colors that appear exactly once in C 
and add an edge between these two vertices. 

4. The 3-character test 

4.1. Triangulating Triples of Characters. We now consider the case of trinary 
input sequences 5' on m characters (for m greater or equal to 4). Our goal is to 
prove that the existence of proper triangulations for all subsets of three characters 
at a time is a sufficient condition to guarantee existence of a proper triangulation 
for all m characters. 



By Lemma 3.3 if a set of three characters x',X-', X*^ is properly triangulatable, 
then there is a unique set of edges -F(x% x' ^x'^) that must be added to triangulate 
the chordless cycles in G[x*, X"*: X*^]- Construct a new graph G'{S) on the same 
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vertices as G{S) with edge set E{G{S))U {Ui<^<j<k<mF{x\x^ ,x'')}- G'{S) is the 
partition intersection graph G{S) together with all of the additional edges used to 
properly triangulate chordless cycles in G'[x% X"* : x'^] (1 ^ * < J < ^ ^ "^)- In 
G'{S), edges from the partition intersection graph G{S) are called i?-edges and 
edges that have been added as triangulation edges for some triple of columns are 
called i^-edges. We call a cycle consisting only of i?-edges an E-cycle. 

Example 4.1. Consider input set S and the corresponding partition intersection 



graph G{S) in Figure 10 Each triple of characters in S induces a chordal graph 
while the entire partition intersection graph G{S) contains a chordless cycle of 
length four. Since each triple of characters induces a chordal graph, no i^-edges are 
added and G{S) = G'{S). 




10 12 
2 111 
2 2 1 











d, q 











Figure 10. Partition intersection graph G'{S) contains a chord- 
less four cycle. 



As Example 4.1 illustrates, the addition of _F-edges alone may not be sufficient 
to triangulate the entire partition intersection graph. We now turn to the problem 
of triangulating the remaining chordless cycles in G'{S). 

Consider any £^-cycle C that is chordless in G'{S) satisfying the properties 

(1) C has length equal to four 

(2) all colors of G are distinct 

For every such chordless cycle, add the chords between the two pairs of nonadja- 
cent vertices in C (note that these are legal edges). Call this set of edges i^'-edges 
and let G"{S) denote the graph G'{S) with the addition of F'-edges. Note that 
the sets of i?-edges, F-edges, and F'-edges are pairwise disjoint; we call the set of 
F and F'-edges non-E edges. 

We begin by investigating structural properties of cycles in G'{S) and G"{S) 
containing at least one F-edge or F'-edge. Let C be a cycle in G'{S) or G"{S) 
containing an edge / that is an F-edge or F'-edge (without loss of generality, let 
/ = (ao,6o))- This edge must be added due to an F-cycle D containing ao,6o and 
two other vertices w and z as shown in Figure 1 1 'a) (note that w and z cannot have 



color a or 6). If / is an F-edge, then w and z have the same color and therefore 
cannot be adjacent in G'{S). If f is an F'-edge, then since D is a chordless F-cycle 
in G'{S), w and z are again nonadjacent in G'{S). The cycle G created by edge / 
is shown in Figure |ll[ b) . 

Since D is an F-cycle, each edge in D has a row witness. Consider first the row 
witnesses for edges (ao,w) and (ao,z). These row witnesses must contain a state 
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(a) 



(b) 



Figure 11. (a) Chordless cycle D (b) edge / = (ao,6o) creates 
cycle C (shown in bold). 



of b other than bo (since ag and 60 are not connected by an E'-edge) . If both row 
witnesses share the same state bi of b, then the cycle (6i, w), {w, bo), {bo, z), {z, bi) 



is a chordless i?-cycle on at most three colors in G'{S) as shown in Figure 12 (as 



argued above, w and z are nonadjacent in G'{S)). However, all chordless S-cycles 
on at most three colors have been triangulated in G'{S), a contradiction. 




z 

Figure 12. If the row witnesses for (ao,w) and {ao,z) share a 
state of b, there is a chordless £^-cycle of length four on at most 
three colors. 

Therefore, the row witnesses for {ao,w) and (ao, z) cannot share the same state 
of b. Similarly, the row witnesses for (bo, w) and (bo, z) cannot share the same state 
of a. This implies the following situation, up to relabeling of the states, illustrated 
in Figure |13[ 




Figure 13. Pattern of forced witnesses for edges in D. 



In particular, the following two conditions must be satisfied. 
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(1) flg is adjacent to both bi and 62 



(2) bo is adjacent to both ai and a2 



We use this structure to prove a sequence of lemmas ehminating the possibiHties 
for chordless cycles in graph G'{S). This sequence of lemmas will show G"{S) 



cannot contain a chordless cycle with exactly one non-ii^ edge (Lemmas 4.2 and 
4.3 1, a chordless cycle with two or more non-i? edges (Lemma |4.4[), or a chordless 



i?-cycle (Corollary 4.8) 



Lemma 4.2. G'{S) cannot contain a chordless cycle with exactly one F-edge. 

Proof. Suppose that C is a chordless cycle in G'{S) with exactly one F-edge, 
say / (flojfeo)- Edge (ao,&o) must have been added due to a chordless F-cycle 
D on three colors as shown in Figure 13 where w and z are states of the same 
color. Note that edge (ao,6o) is a forced F-edge that creates cycle C (see Figure 
14 1. If C contains only the two colors a and b, the partition intersection graph on 
the three colors a, 6, and the shared color of w and z is not properly triangulatable, 
a contradiction. 




Figure 14. Chordless cycle C on two colors with exactly one F- 
edge. 

This implies any cycle C in G'{S) with exactly one F-edge must contain three 
or more colors. As shown in Figure [Ts] if any of the edges (61, ai), (61, 02), (62, ai), 
and (t'2, 02) are present, there would be a chordless cycle on two colors with exactly 
one F-edge, which we have argued cannot occur. It follows that Oi and 02 are 
nonadjacent to 61 and 62 by F-edges. 

Since oi is nonadjacent to 61 or &2 by F-edges, any row that contains ai must 
contain state b^ in character b. We call this condition (Al). By a similar argument, 
the following conditions must be satisfied: 

(A2) any row that contains 02 must contain state &o in character b. 
(Bl) any row that contains bi must contain state ao in character a. 
(B2) any row that contains 62 must contain state ao in character a. 

Now, let a; be a vertex in C\{ao, foo} and consider the state of character a in any 
row that witnesses x (see Figure [l6|^ a)). If this state is ao, then x is adjacent to oq 
by an F-edge. Otherwise, if this state is either oi or 02, then this row witness for x 
must contain state 60 by (Al) and (A2). Since C is a chordless cycle, at most one 
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Figure 15. If any of (61, ai), (61,02), (62,01), or (62,02) are E- 
edges, there is a chordless four cycle in G'{S) on two colors with 
exactly one i^-edge. 



vertex on C\{ao, 60} can be adjacent to each of oq and 60. This shows there can be 
at most two such vertices xi and X2 in C\{ao,6o}, one of which is adjacent to oq 
and the other which is adjacent to 60 (moreover, these are adjacencies by i?-edges). 
Therefore, C has length equal to four formed by edges (ao,a:;i), {xi,X2), (0:2,60), 
and (60, Co) (see Figure [l6[b)). 




(a) (b) 



Figure 16. Vertices on cycle C. 



Edge {xi,X2) is an i?-edge since / = (ao,6o) is the unique F-edge in C by 
assumption. Furthermore, at least one of xi or X2 has color different from a and 6 
since C has three or more colors. Without loss of generality, assume vertex xi has 
color different from a and 6. The color of X2 is different from 6 since X2 and 60 are 
adjacent, implying edge (xi, X2) must have a witness in character 6. If this witness 
is 60, then xi and 60 are adjacent by an i?-edge, a contradiction to the chordlessness 
of cycle C. If this witness is either 61 or 62, then (Bl) or (B2) imply that X2 and qq 
are adjacent by an £'-edge, again a contradiction to the chordlessness of cycle C. 



This concludes the proof of Lemma 4.2 □ 



A similar proof shows that the lemma can be extended to the graph G"{S). 

Lemma 4.3. G"{S) cannot conatin a chordless cycle with exactly one non-E edge. 

Proof. Suppose that C is a chordless cycle in G"{S) with exactly one non-E edge, 
say / = (ao,6o). If / is an F-edge, then G would be a chordless cycle in G'{S) 



with exactly one F-edge, contradicting Lemma 4.2 Therefore / is an F'-edge that 
is added due to chordless cycle D as shown in Figure 13 (with w and z different 
colors) . 
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Case I. C contains only the two colors a and 6. Since the graph in Figure [13] 
contains all the states of characters a and 6, C must also contain one of the edges 
(&i,ai), (61,02), (62,01), or (62,02) as an i?-edge and we have the following cases. 
Case I(i). C contains edge (62, 02). This results in an _B-cycle of length five 



on at most three colors as shown in Figure 17 ^a). Such a cycle cannot be 



chordless in G[a, 6, w] by Lemma 3.4 Therefore, vertex w must be adjacent 
to one of 62 or 02 by an _B-edge. This creates a chordless i?-cycle in G'{S) 
of length four on three colors; either cycle (62, w), (u;,6o), (^o,^), (^,^2) 
shown in Figure [iTj^b) or cycle (ao,w), (w, 02), (02,2), {z,ao) shown in 
Figure 17 'c) (note that w and z are nonadjacent in G'{S) since cycle D is 
chordless). This is a contradiction since all cycles on at most three colors 
are triangulated in G'{S). 






(a) 



(b) 



(c) 



Figure 17. (a) Edge (62, 12) gives a five cycle G on at most three 
colors G (b),(c) Chordless cycle of length four containing three 
colors. 



Case I(ii). G contains edge (61, ai). This case is symmetric to Case I(i). 
Case I(iii). G contains edge (62,01). This results in an i?-cycle of length 
four on at most three colors as shown in Figure 18 'a). Such a cycle is 
triangulated in G'{S), implying there is either an i?-edge or an i^-edge 
between 62 and w. Then the cycle (62,11'), {w,bo), {bo,z), (2:, 62) is either 
a i?-chordless cycle in G'{S) (a contradiction since all ii^-cycles on at most 
three colors are triangulated in G'{S)) or a chordless cycle in G'{S) with 



exactly one i^-edge (contradicting Lemma 4.2) 




(a) 



(b) 



Figure 18. (a) i?-cycle of length four on three colors (b) Chordless 
cycle on three colors with exactly one -F'-edge. 
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Case I(iv). C contains edge (61,02). This case is symmetric to Case I(iii). 
It follows that none of the vertex pairs (5i,ai), (61,02), (62,01), and (62,02) are 
adjacent by an i5-edge and any cycle C in G"{S) with exactly one wow-E edge must 
contain three or more colors. Because of these nonadjacencies, the statements (Al), 



(A2), (Bl), (B2) from Lemma 4.2 hold 



(Al) any row that contains oi must contain state 60 in character 6. 
(A2) any row that contains 02 must contain state 60 in character 6. 
(Bl) any row that contains 61 must contain state oq in character o. 
(B2) any row that contains 62 must contain state oq in character o. 



As in the proof of Lemma |4.2[ it follows that cycle C has length equal to four 
formed by edges (oo,xi), (a;i,a;2), (a;2,6o), and (60, oq) (see Figure [T9|b)). 





(a) 



(b) 



Figure 19. Vertices on cycle C. 



Now, edge {xi,X2) is an i?-edge since / — (09,60) is the unique non-E edge in 
C by assumption. The remainder of the proof follows exactly as in the proof of 
Lemma [O] □ 

We now consider chordless cycles in G"{S) with two or more non-E edges. 

Lemma 4.4. G"(S) cannot contain a chordless cycle with two or more non-E 
edges. 

Proof. Suppose otherwise and let C be a chordless cycle in G"{S) with two or 
more non-i? edges. Let / be one of the F or F'-edges in G and without loss of 
generality, let / = (oo,6o). This edge must have been added due to an i?-cycle 
D that is chordless in G{S) on 00,69 and two other vertices w and z (see Figure 
[iTJa)). If / is an F-edge, then w and z have the same color and therefore are not 
adjacent in G'{S). If / is an F'-edge, then w and z have different colors and are 
nonadjacent in G'{S) (since they are nonadjacent vertices in chordless cycle D). 

As argued previously, the situation up to relabehng of the states is illustrated 
in Figure [T3| Furthermore, the proofs of Lemmas |4.2| and |4.3| establish conditions 
(Al), (A2), (Bl), and (B2), implying G has length equal to four formed by edges 
(oojXi), {xi,X2), {x2,bo), and (60,09) (see Figure [Tof^b)). Then since C has two or 
more non-F edges, the edge (xi,X2) in C is a non-F edge. 

We have the following cases for the vertices of C. 
Case I. One of Xi and X2 has color a and the other has color 6. We can assume 



without loss of generality that Xi — 62 and X2 — 0,2 as illustrated in Figure 20 ^a) 
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Since edge {xi,X2) is either an i^-edge or an F'-edge, it was added because of a 
chordless i?-cycle D' containing 02, &2 and two other vertices ijq and t/i (see Figure 
[20)^ a)). By (A2), both yo and yi are adjacent to ap, giving an i?-cycle of length four 
{<^o,yo), (2/0:12), (02,2/1), (2/1,00) on at most three colors (see Figure [20](c)). This 
i?-cycle must be triangulated in G'{S). However, this cannot be the case since D' 
is a chordless cycle in G'{S) and j/o Sind yi are nonadjacent vertices in D' . 




ji yi yi 



(a) (b) (c) 



Figure 20. Case I. 



Case II. The color of xi is different from a and b and the color of X2 is a or b. 
Without loss of generality, assume X2 = 02 as illustrated in Figure 21 ^a). Since 



edge {xi,X2){— (2:1,02)) is an F-edge or F'-edge, it was added due to a chordless 
four cycle on xi,X2{— 02) and two other vertices yo and yi. The row witnesses for 
edges (xi,i/o) and (xi^yi) must contain state oq (otherwise, xi would be adjacent 
to 60 by (Al) or (A2)). Then we have the F-cycle of length four (09,2/0), (2/0,02), 
(02,2/1), (2/1,00) on at most three colors. This F-cycle must be triangulated in 
G'{S). However, this cannot happen since yo and yi are nonadjacent vertices in 
cycle D' . 




(a) 



(b) 



(c) 



Figure 21. Case H. 



Case II'. The color of xi is a or 6 and the color of X2 is different from a and b. 



This case is symmetric to that in Case II and is shown in Figure 22 



Case III. Both xi and X2 have colors different from a and b. Since edge {xi,X2) 
is an F-edge or F'-edge, it was added due to a chordless four cycle on Xi,X2 and 
two other vertices yo and yi. The row witnesses for edges {xi, yo) and {xi, yi) must 
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Figure 22. Case IF. 



contain state uq (otherwise, xi would be adjacent to 60 by (Al) or (A2)). Then 
(ao,2/o), {yo,X2), {x2,yi), {yi,ao) is an iJ-cycle of length four (see Figure ^Sjc)). 



Note that this cycle is chordless in G'{S), since ao and X2 are nonadjacent vertices 
in chordless cycle C and yo and j/i are nonadjacent vertices in chordless cycle D' . 
If j/o and yi have the same color, then C has only three colors and this would force 
edge (oo, X2) to be an F-edge, a contradiction to the assumption that C is chordless 
in G"{S). Therefore, the colors of ao,X2,yo,yi are all distinct. This cycle would 
force edges (ao, X2) and (yo, yi) to be added as F'-edges, a contradiction since cycle 
C is chordless in G"{S) and ao and X2 are nonadjacent vertices in C. 




xo- 



(a) 





Figure 23. Case IIL 



This proves the lemma. □ 
Lemmas 4.2[ 4.3| and 4.4 eliminate the possibility of chordless cycles in G"{S) 
containing non-F edges. To show that G"{S) is properly triangulated, we proceed 
to show that G"{S) does not contain chordless F-cycles. Suppose G is an F-cycle 
of length five or greater that is chordless in G'{S) and suppose there is a character 
a that appears exactly once (say in state ao) in G. Label the edges of the path 
C\ao in order of appearance by ei, 62, 63, . . . Ck-i with e; — (vi, Vi+i). Since G is 
chordless and all edges in G are i?-edges, each edge {i = 1, 2, ... fc — 1) must 
be witnessed by a row Si which contains either state Oi or 02 in color a. Without 
loss of generality, assume ei is witnessed by ai and let j be the largest index such 
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that Cj is witnessed by oi. If j is equal to fc — 1, then this creates a four cycle 
(wi, ao), (ap, Ufe), (wfc, Oi), (ai, fi) on i?-edges (see Figure p4](b)). Since vi and 
are nonadjacent (by the chordlessness of C in G'{S)), this creates an i?-cycle on at 
most three colors that is chordless in G'{S), which cannot occur. 




(a) (b) 



Figure 24. Chordless Cycle C 



Therefore, j must be strictly less than fc — 1 and all of the remaining edges 
Cj+i, . . . efe_i are witnessed by state a2- Define the a-complete cycle induced by 
cycle C and state as follows (see Figure 25 1: 



iaQ,Vi),{vi,V2),{v2,a2),{a2,Vk),{vk,aQ) if j = 1 
/(C, ao) = { {ao,Vi), (wi,ai), {ai,Vj+i), {vj+i,a2), {a2,Vk), (wfe,ao) if 1 < j < fc - 2 
{ao,vi),{vi,ai),{ai,Vk-i),{vk-i,Vk),{vk,ao) if j = fc - 2 




Figure 25. The a-complete cycle induced by color C and state 
ao HC, ao) (a) J = 1 (b) 1< J < fc - 1 (c) j = k- 2. 



Observation 4.5. For an E-cycle C such that 

(i) C is chordless in G'{S) 

(ii) C has length five or greater 

(iii) C contains a character a appearing exactly once in state Cq, 

the a-complete cycle I{C, ao) exists. Note that I{C, ao) contains at least two vertices 
of color a and has length five or greater. 

We use this construction to prove the following lemma. 

Lemma 4.6. Suppose C is an E-cycle of length five or greater that is chordless 
in G'{S) and suppose there is a character a appearing uniquely in G in state ao. 
Then the two vertices adjacent to ag in G have different colors and I{G, ao) is an 
E-cycle that is chordless in G'{S). 
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Proof. Note that I{C,ao) exists by Observation 4.5 and all edges in I{C,ao) are 
i?-edges. We show /(C, oq) is chordless in G'{S). The vertex pairs {ai,Vk) and 
(a2,wi) are not adjacent in G'{S); otherwise we would obtain a four cycle on at 



most three colors with at most one i^-edge that is chordless in G'{S) (see Figure 26 1 



This is a contradiction, since Lemma 4.2 implies G'{S) cannot contain a chordless 



cycle with at most one _F-edge. The remaining vertex pairs in /(C, ag) are in C 
and are nonadjacent in G'{S) since G is chordless in G'{S). It follows that /(C, Og) 
is chordless in G'{S). 





(a) (b) 

Figure 26. If either {ai,Vk) or {a2,vi) are adjacent, there is a 
cycle on at most three colors with at most one i^-edge. 

Now suppose for a contradiction that the vertices adjacent to qq (vertices Vi and 



Vk in Figure 25 ) have the same color. Then /(C, ao) is a cycle on at most three colors 
(color a, the color of fj+i, and the shared color of vertices vi and Vk). This is an 
i?-cycle that has length five or greater and is chordless in the partition intersection 
graph on these three colors. This is forbidden by Lemma [3. 4[ Therefore, the two 
vertices adjacent to oq are states in two different colors. 

This proves the lemma. □ 
We now use this construction to prove properties of chordless i?-cycles in G'{S). 

Lemma 4.7. // G is an E-cycle that is chordless in G'{S), then C has length 
exactly four with four distinct colors. 

Proof. Suppose C is a chordless i?-cycle in G'{S). Note that G must contain four 
or more colors since any chordless i?-cycle on at most three colors is triangulated 
in G'{S). We first show every color in G appears uniquely. Suppose otherwise and 
let a be the color that appears the most often in G with fa the number of times a 
appears. We consider the following cases. 

Case I. fa — 3, i.e., all three states ao,ai, and 02 appear in G. 

If there is an edge e — {u,v) in G that does not have any of ao,ai, or 02 as 
endpoints, then consider the row r that witnesses edge e; row r must contain some 
state of a, say a^. This implies edges {u, Oj) and (u, Oj) are present in G'{S) and G 
is not chordless, a contradiction. Therefore, in this case, every edge e in C must 
have exactly one endpoint of color a. 

Since G contains four or more colors and every edge is adjacent to a state of a, 
by possibly renaming the character states, the color pattern must be as shown in 



Figure 27 (with distinct colors b, c, and d). Now, since G has length at least five 



and color b appears uniquely in cycle C, the ^-complete graph I{G, bo) induced by 
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C and bo exists. However, the vertices adjacent to 60 in C are the same color (both 
having color a), which is forbidden by Lemma 4.6 It follows that fa < 3. 





Figure 27. Color pattern in Case I. 

Case II. fa = 2: let qq and oi be the two states of a appearing in C. Since C 
contains four or more colors, it must be the case that one of the paths from aq to 
ai has three or more edges. We have the following cases. 

Case (Ila) both paths from uq to ai have at least three edges 

Case (lib) one path from uq to oi has two edges and the other path has 
three or more edges 

Any edge that does not have color a as one of its endpoints must be witnessed 
by a row that contains the third state 02 , as illustrated in Figure |28[ 





(Ila) 



(lib) 



Figure 28. Cases (Ila) and (lib) in the proof of Lemma 4.7 The 
rows witnessing the edges shown in bold must contain state 02 in 
character a. 

In case (Ila), the second edge in both paths from oq to oi are witnessed by state 
a2 and we obtain an iJ-cycle that is chordless in G'{S) on at most three colors 
(shown in bold in Figure |29ja)). This is a contradiction since all i?-cycles on at 
most three colors must be triangulated in G'{S). In case (lib), the second and 
second to last edge on the oq to ai path with three or more edges are witnessed 
by color 02. Let D denote the E'-cycle of edges (60, oq), (ao,M), (m, 02), (02, w), 
(w,Oi), (ai,6o) (shown in Figure 29 ^b)). Then 02 and 60 are not adjacent in G'{S) 



(otherwise, we would obtain a cycle of length four on at most three colors with at 
most one F-edge). This implies D is an i?-cycle that is chordless in G{S); in this 
cycle, bo has two adjacent vertices of color a and therefore cannot be the only state 
of b appearing in £), by Lemma |4.6| This implies one of u or u must also have color 
b and therefore D is a chordless cycle on at most three colors containing all three 
states of character a, contradicting Lemma [XT] 
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Figure 29. Case II. 

Case III. /a = 1, i.e., every color in C appears uniquely. Suppose for a contradic- 
tion that C has length five or greater and let oq be a state appearing in C . Then 



/(C, ao) exists in G'{S) by Observation 4.5 However, this gives a chordless cycle 



in G'{S) with color a appearing two or more times, which cannot happen by Cases 
I and II. 

It follows that C is a cycle of length four with all colors appearing uniquely in 
C, proving the lemma. □ 



Lemma 4.7 implies all chordless £^-cycles in G'{S) have length four containing 
four distinct colors. We have triangulated all such cycles by i^'-edges in G"{S), 
implying the following corollary. 

Corollary 4.8. G"(5) cannot contain a chordless E-cycle. 



Lemmas 4.2 4.3| 4.4[ and Corollary 4.8 together imply that G"{S) is properly 



triangulated, proving the main theorem. 



Theorem |2.4| Given an input set S on m characters with at most three states per 
character (r ~ 3), S admits a perfect phylogeny if and only if every subset of three 
characters of S admits a perfect phylogeny. 

5. Enumerating Obstruction Sets for Three State Characters 

We now turn to the problem of enumerating all minimal obstruction sets to 
perfect phylogenies on three-state character input. By Theorem |2.4| it follows that 
the minimal obstruction sets contain at most three characters. We enumerate all 
instances S on three characters a, 6, and c satisfying the following conditions: 

(i) each character a, h and c has at most three states 

(ii) every pair of characters allows a perfect phylogeny 

(iii) the three characters a, 6, and c together do not allow a perfect 
phylogeny. 

Note that Condition (ii) implies the partition intersection graph G{S) does not 
contain a cycle on exactly two colors and Condition (iii) implies G{S) contains at 
least one chordless cycle. Let C be the largest chordless cycle in G{S), i.e., 

C — arg maXj.ijQ^jj(,j,g cycles d in g(S) l-^l 



Condition (ii) and Lemma 3.1 together imply C cannot contain all three states 
of any character. Therefore, G has length at most six. If G{S) contains a chordless 
six-cycle C, then each color appears exactly twice in G and G must have one of the 



color patterns (up to relabeling) shown in Figure 30 
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a, 
(a) 



(b) 



Figure 30. Color patterns for chordless cycle of length six. 



In Figures ^Ol a) and 30 'b), there is one state in each character that does not 



appear in C (states a2, 62, and C2). Since C is chordless, the witness for each edge is 



forced to contain the missing state in the third character. This implies Figure 30 'a) 



by the edges in Figure 32 



h (see Figures 31 'b) and 



must be completed by the edges in Figure 31 'a) and Figure 30 'b) must be completed 



'a). In both cases, there is a cycle on two characters a and 
32 ^b)). This implies the pair of characters a and h is not 



properly triangulatable, a contradiction to condition (ii). Therefore, G{S) cannot 
contain chordless cycles of length six. 





(b) 



Figure 31. Forced patterns for row witnesses of Figure 30 'a) 
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If C is a chordless cycle in G{S) of length five, then G{S) is not properly triangu- 
latable by Lemma |3.4[ implying Condition (iii) is satisfied. In this case, there must 
be two characters (say b and c) appearing in two different states and one character 



appearing once in C, as shown in Figure 33 (up to relabeling of the states). Cycle 



C contains three edges that are not adjacent to character a (edges {bo, cq), (cq, 



(61, ci) in Figure 33). The row witnesses for these edges must contain either state 



Oi or a2 in character a. 




Figure 33. Color pattern for cycle C of length five. 



Case I. The row witnesses for two adjacent edges share the same state of a and 
the row witness for the third edge contains the final state in a. Without loss of 
generality, assume (co,&i) and (61, ci) are the two adjacent edges sharing the same 
state of a. In this case, G{S) and the corresponding input sequences S are shown 



in Figure 34 (up to relabeling of the states). 




Figure 34. Case I. Row witnesses for two adjacent edges share 
the same state of a. 



Case II. The row witnesses for the two nonadjacent edges share the same state 
of a and the row witness for the third edge contains the final state in a. In this 



case, G{S) and the corresponding input sequences S are shown in Figure 35 (up to 
relabeling of the states). 

Case III. The row witnesses for all three edges share the same state of a. In this 



case, G{S) and the corresponding input sequences S are shown in Figure 36 (up to 
relabeling of the states). 
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Figure 36. Case III. Row witnesses for all three edges share the 
same state of a. 
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If C is a chordless cycle of length four, then without loss of generality it must 
have the color pattern shown in Figure |37| 




Figure 37. Color Pattern for chordless cycle of length four. 

Consider the row witnesses for edges (ao,co) and (ao,ci). These row witnesses 
cannot share the same state of b (otherwise, there would be a cycle on two colors 
b and c, a contradiction). Similarly, row witnesses for edges (69, cq) and (60, ci) 
cannot share the same state of a. Therefore, up to relabeling of the states, the row 
witnesses are forced to have the pattern shown in Figure |39[ 




Figure 38. Row witnesses for edges (ao,co) and (ao,ci) cannot 
share the same state of b. 




Figure 39. Forced pattern of row witnesses. 

Note that 62 and cq cannot be adjacent in G{S); otherwise, there is a cycle on 
two colors b and c (see Figure [40]( a)). By symmetry, we can argue (see Figure 40 1 
pairs (62, Cq), (02, Cq), (5i, ci), and (oi, Ci) are nonadjacent in G(S') (*) 

Now, suppose 62 and ai are adjacent in G{S). Then the row witness for (62, ^i) 
cannot be cq and cannot be ci by (*). Therefore, the row witness for this edge 



must be the third state C2 of character c (see Figure 41 1. The partition intersection 
graph G{S) and corresponding input sequences S are shown in Figure 41 Note 
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Figure 40. (62, co), (02, cq), (61, ci), (ai,ci) induce cycles on two 
colors b and c. 

that G(S') is not properly triangulatable and condition (iii) is satisfied, since the 
edge (ao, &o) is a forced edge to triangulate cycle C, creating a cycle on two colors 
(ao,&o), (^OjOi), (01,62), (62,10) which cannot be properly triangulated. 




Figure 41. Input sequences S and partition intersection graph 
G{S) with a chordless cycle of length four. 

If 62 and a2 are adjacent in G{S), then this induces a chordless cycle D of length 
five (62, ao), (ao,co), (co,&o), (^0,02), (a2,62) (the pairs (62, cq) and (02,00) are 
nonadjacent by (*) and (ag, 60) are nonadjacent since they are nonadjacent vertices 
in chordless cycle C). This is a contradiction since C is chosen to be the largest 
chordless cycle in G{S). 




Figure 42. If &2 and 02 are adjacent in G{S), this creates a chord- 
less cycle of length five. 

Suppose there are no further adjacencies between vertices in Figure |39) Then 
there must be additional edges formed by the final state C2 of character c in order 
for G[a,b,c] to be nontriangulatable (condition (iii)). Now, state C2 is adjacent 



28 



FUMEI LAM, DAN GUSFIELD, AND SRINATH SRIDHAR 



to one or more of the edges with color pattern (a, 6). If C2 is adjacent to exactly 
one such edge, then the resulting graph G[a, b, c] can be properly triangulated by 
adding the edge (ao,5o). Otherwise, state C2 is adjacent to two or more edges. If 
the two edges share a vertex (i.e., the two edges are either (ai,6o) and (02,60) or 
(&i,ao) and (62,00)); then there is a cycle on two colors (as shown in Figure B3[a) 



and 43 'b)), contradicting condition (ii) 





(a) 



(b) 



Figure 43. If C2 witnesses two adjacent edges in G{S), this creates 
a chordless cycle on two colors. 



Else if state C2 is adjacent to two nonadjacent edges in G{S) (Figure 44 'a) and 
44 'b)), then this again creates a chordless cycle on two colors as shown in Figure 
44 'c), contradicting condition (ii). 






(a) 



(b) 



(c) 



Figure 44. If C2 witnesses two nonadjacent edges in G{S), this 
creates a chordless cycle on two colors. 

In summary. Figure [45]shows the minimal obstruction sets to the existence of per- 
fect phylogenies for three-state characters up to relabeling of the character states. 

6. Structure of Proper Triangulations for Partition Intersection 
Graphs on Three-State Characters 

The complete description of minimal obstruction sets for three-state characters 
allows us to expand upon recent work of Gusfield which uses properties of legal tri- 
angulations and minimal separators of partition intersection graphs to solve several 
problems related to multi-state perfect phylogenies [?]. In particular, the following 
is a necessary and sufficient condition for the existence of a perfect phylogeny for 
multi-state data. We refer the reader to [?] for the necessary definitions and the 
proof. 




Figure 45. Minimal obstruction sets for three-state characters up 
to relabeHng. 

Theorem 6.1 ([?] Theorem 3 MSPN). Input S allows a perfect phylogeny if and 
only if there is a set Q of pairwise parallel legal minimal separators in partition 
intersection graph G{S) such that every mono- chromatic pair of nodes in G{S) is 
separated by some separator in Q. 

For the special case of input S with characters over three states, the construction 
of minimal obstruction sets in Section [s] allows us to simplify Theorem |6.1| to the 
following. 

Theorem 6.2. For input S on at most three states per character (r < 3), there is 
a three-state perfect phylogeny for S if and only if the partition intersection graph 
for every pair of characters is acyclic and every mono- chromatic pair of nodes in 
G{S) is separated by a legal minimal separator. 



Proof. By Theorem |2.4| if three-state input S does not allow a perfect phylogeny 
then there is a triple of characters a, 5, c in S that does not allow a perfect phy- 
logeny. By Section [s] if every pair of characters (a,5),(a, c) and (6, c) is acyclic 
(and therefore allow a perfect phylogeny) while a, b, and c together do not allow a 
perfect phylogeny, then one of the graphs in Figure [45] appears as an induced graph 
in partition intersection graph G{S) (up to relabeling). In each of these graphs, 
consider the mono-chromatic pair of vertices cq and ci ; we will show that in each 
graph, no legal separator can separate these two vertices. 

In the first graph, consider the following three vertex disjoint paths from cq to 



Ci (shown in Figure 46 ) 

(1) Co ^ a2 ^ ci 

(2) Co 6i ci. 

(3) Co bo ao ^ ci 

From these disjoint paths, we see that any separator Q for co and Ci must include 
vertices a2 and bi (to destroy paths (1) and (2)). Furthermore, Q must also contain 
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Figure 46. Disjoint paths from cq to c\ 



one of flo or (to destroy path (3)), implying that Q must contain at least two 
states of the same color and therefore cannot be a legal separator. 



A similar argument using the disjoint paths in Figure 47 shows that any separator 
Q for vertices cq and c\ cannot be a legal separator. Therefore, if every monochro- 
matic pair is separated by a legal minimal separator, none of these graphs in Figure 
|45| appears as a subgraph of the partition intersection graph, implying that every 
subset of three characters allows a perfect phylogeny. Theorem |2.4| then shows the 
entire set of characters allows a perfect phylogeny. 




Figure 47. Disjoint paths showing all separators of cq and c\ are 
illegal. 



The other direction of the theorem follows from Theorem roT 



□ 

Theorem 6.2 implies that the requirement of Theorem 6.1 that the legal minimal 
separators in Q be pairwise parallel, can be removed for the case of input data over 
three-state characters. 



7. Construction of Fitch-Meacham Examples 

In this section, we examine in detail the class of Fitch-Meacham examples, which 
were first introduced by Fitch [THl [H] and later generalized by Meacham [27] . 
The goal of these examples is to demonstrate a lower bound on the number of 
characters that must be simultaneously examined in any test for perfect phylogeny. 
The natural conjecture generalizing our main result is that for any integer r > 3, 
there is a perfect phylogeny on r-state characters if and only if there is a perfect 
phylogeny for every subset of r characters. We show here that such a result would 
be the best possible, for any r. While the general construction of these examples 
and the resulting lower bounds were stated by Meacham [27], to the best of our 
knowledge, the proof of correctness for these lower bounds has not been established. 
We fill this gap by explicitly describing the complete construction for the entire class 
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of Fitch-Meacham examples and providing a proof for the lower bound claimed in 

m- 

For each integer r (r > 2) , the Fitch-Meacham construction Fr is a set of r + 2 
sequences over r characters, where each character takes r states. We describe the 
construction of the partition intersection graph G{Fr); the set of sequences Fr can 
be obtained from G{Fr) in a straightforward manner, with each taxon corresponding 
to an r-clique in G{Fj.). 

Label the r characters in Fr by 0,1, ... r ~ 1; each vertex labeled by i will 
correspond to a state in character i. The construction starts with two cliques ECi 
and EC2 of size r, called end-cliques, with the vertices of each clique labeled by 
0, 1, ... r — 1. The vertex labeled i in ECi is adjacent to the vertex labeled {i + 1) 
mod r in EC2. For each such edge (i, (i -I- 1) mod r) between the two end-cliques, 
we create a clique of size r — 2 with vertices labeled by {0, 1, ... r — l}\{j, {i + 1) 
mod r}. Every vertex in this (r — 2)-clique is then attached to both i (in end-clique 
1) and (i -|- 1) mod r (in end-clique 2), creating an r-clique whose vertices are 
labeled with integers 0, 1, ... r — 1. There are a total of r such cliques, called tower- 
cliques, and denoted by TCi, TC2, . . . TC^. Note that for each i (0 < i < r — 1), 
there are exactly r vertices labeled by i; we give each such vertex a distinct state, 
resulting in r states for each character. 

Note that the graph corresponding to the four gamete obstruction set is an 
instance of the Fitch-Meacham construction with r — 2. In this case, the four 
binary sequences 00,01, 10, 11 have two states, two colors and four taxa and the 
partition intersection graph for these sequences is precisely the graph G{F2). Note 
that in this case, every subset of r — 1 = 1 characters has a perfect phylogeny, while 
the entire set of characters does not. Similarly, the fourth graph shown in Figure [45| 
illustrating the obstruction set for 3-state input is the graph G^F^) corresponding 
to the Fitch-Meacham construction for r = 3 (in the figure, ECi= {ao,&2,ci} and 
EC2 = {ai,bo, Co}). As shown in Section |5] every r — 1 = 2 set of characters 
in the corresponding input set allows a perfect phylogeny while the entire set of 
characters does not. The following theorem generalizes this property to the entire 
class of Fitch-Meacham examples. Because the theorem was stated without proof 
in [27], we provide a proof of the result here. 

Theorem 7.1. |27j For every r > 2, Fr is a set of input sequences over r state 
characters such that every r — I subset of characters allows a perfect phylogeny while 
the entire set Fr does not allow a perfect phylogeny. 

Proof. We first show that G{Fr) does not allow a proper triangulation for any r. 
As observed above, G{F2) is a four cycle corresponding to two colors and therefore, 
does not allow a proper triangulation (since any proper triangulation for a graph 
containing cycles must have at least three colors). Suppose G{Fr) is properly 
triangulatable for some r > 3, let s be the smallest integer such that G{Fs) has a 
proper triangulation, and let G' {Fg) be a minimal proper triangulation of G{Fs). 

For each tower-clique TCj in G{Fs), consider the set of vertices in TCj that are 
not contained in either end-clique; call these vertices internal tower-clique vertices 
and the remaining two tower vertices end tower-clique vertices. Note that the 
removal of the two end tower-clique vertices disconnects the internal tower-clique 
vertices from the rest of the graph. This implies that the internal tower-clique 
vertices cannot be part of any chordless cycle: otherwise, such a chordless cycle C 
must contain both end tower-clique vertices i and {i+ 1) mod s. However, the two 
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end tower-clique vertices are connected by an edge and therefore induce a chord in 
C, a contradicition since C is a chordless cycle. 

In the graph G{Fs), onsider the following cycle of length four: s — 2 (in ECi) 
-» s - 1 (in ECi) ^ (in EC2) ^ s - 1 (in EC2) ^ s - 2 (in ECi). This four-cycle 
has a unique proper triangulation, which forces the edge e between vertex s — 2 
in ECi and vertex in EC2 to be included in G'{Fs). Consider adding edge e, 
removing all vertices labeled s — 1 from G'{Fs), and for the two vertices labeled 
s — 1 in end-cliques ECi and EC2, remove all interior tower-clique vertices (but not 
end tower-clique vertices) adjacent to s — 1. Then edge e between vertices s — 2 and 
is still present and we can expand e into a tower-clique of size s — 1 (by forming 
a clique with new vertices 1, 2, ... s — 3 adjacent to both s — 2 and of the two 
end-cliques). 

In the resulting graph, the vertices are exactly those of G'(F,_i) and all edges in 
G{Fs-i) are present. Furthermore, if there is a chordless cycle in this graph, then 
it would create a chordless cycle in G'{Fs) since no internal tower-clique vertex 
can be part of any chordless cycle (and in particular, the new vertices 1, 2, ... s — 3 
cannot be part of any chordless cycle). Therefore, the resulting graph is a proper 
triangulation for G{Fs_i), a contradiction since s was chosen to be the smallest 
integer such that G{Fs) allows a proper triangulation. 

To prove the second part of the theorem, we show that in Fr , any subset of r — 1 
characters does allow a perfect phylogeny by proving that the partition intersection 
graph on any subset of r— 1 characters has a proper triangulation. By the symmetry 
of the construction of F^, we can assume without loss of generality that the r — 1 
characters under consideration are {0, 1, . . . r — 2}. Consider the graph obtained 
by connecting every vertex i (0 < i < r — 3) in ECi to every vertex j satisfying 
j > i in EC2. Note the asymmetry between the first and second end-cliques in this 
construction and observe that none of the added edges are between characters with 
the same label. 

Suppose the resulting graph contains a chordless cycle C. Then C cannot contain 
three or more vertices in either end-clique and cannot contain any internal tower- 
clique vertices (as noted earlier), so must have length exactly four with two vertices 
in each end-clique. It cannot be the case that two nonadjacent vertices of C are 
in the same end-clique, since these vertices would be adjacent and C would not 
be chordless. Therefore, cycle C must be formed as follows: i (in ECi) j (in 
EC2) j' (in EC2) — > i' (in ECi). Since i and j are adjacent, we have i < j and 
since i' and / are adjacent, we have i' < j' . li i < j', then i and / are adjacent 
and the cycle C is not chordless, a contradiction. Therefore, i' < j' < i < j, which 
implies i' and j are adjacent and the cycle C is not chordless, again a contradiction. 
It follows that there are no chordless cycles and the added edges form a proper 
triangulation for the partition intersection graph on the subset of r — 1 characters 
{0,1,. ..r-2}. □ 

8. Conflict Hypergraphs for Three State Characters 

For binary data, the Four Gamete Condition/Splits Equivalence Theorem implies 
that the existence of a perfect phylogeny can be determined by a pairwise test. This 

motivates the following definition. 

Definition 8.1. For binary input S on characters C, two characters Xi^Xj G ^ '^'^e 
in conflict or incompatible if Xi o.nd Xj contain all four gametes. The conflict ( or 
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incompatibility j graph C{S) for S is defined on vertices V and edges E with 
V = {xr-x^&C} 

E = {{XiiXj) ■ Xi and Xj are incompatible} 

By the Four Gamete Condition, binary input S allows a perfect phylogeny if 
and only if the corresponding incompatibility graph C{S) has no edges. For input 
containing incompatible characters, there has been extensive literature devoted to 
studying the structure of incompatibility graphs and using these graphs to solve re- 
lated problems. For example, Gusfield et al. p^il9j and Huson et al. |24j use incom- 
patiblity graphs to achieve decomposition theorems for phylogenies, and Gusfield, 
Hickerson, and Eddhu [21], Bafna and Bansal [2 [2], and Hudson and Kaplan [23] 
use it to achieve lower bounds on the number of recombination events needed to 
explain a set of sequences. 

The incompatibility graph is also the basis for algorithms to solve the character 
removal or maximum compatibility problem, which asks for the minimum number 
of characters that must be removed from an input set such that the remaining char- 
acters allow a perfect phylogeny. For binary data, the character removal problem is 
equivalent to the vertex cover problem on the corresponding incompatibility graph 

[11 [30]. 

Until this work, the notion of incompatibility graph was defined for binary char- 
acters only, as motivated by the Splits Equivalence Theorem. Our generalization 
of the Splits Equivalence Theorem therefore allows us to generalize in a natural 
way the notion of incompatibility for three state characters. The resulting in- 
compatibility structure will be a hypergraph whose edges correspond to pairs and 
triples of characters that do not allow a perfect phylogeny. In particular, for input 
data S on three state characters C, let E2{S) be the set of character pairs {xi, Xj) 
such that Xi and Xj do not allow a perfect phylogeny and let £^3(5") be the set of 
triples (xijXjjXk) such that Xi^Xj^Xk together do not allow a perfect phylogeny 
but each of the pairs {Xi,Xj),{Xi,Xk)AXj,Xk) allows a perfect phylogeny (i.e., 
{Xi,Xj),{Xi,Xk),{Xj,Xk) ^ E2{S)). Then the incompatibility hypergraph C{S) is 
defined on vertices V and hyperedges E with 

V = {X^■■X^^C} 

E = E2iS)UE3{S) 

Note that by definition, no hyperedge of C{S) is contained in another hyper- 
edge. The extension of incompatibility to three state characters can be used to 
solve algorithmic and theoretical problems for three state characters analogous to 
those for binary characters. In particular, the character removal problem on three 
state characters can be solved using the following generalization of the vertex cover 
problem. 

3-Hitting Set Problem 

Input: A collection M of subsets of size at most three from a finite ground 
set and a positive integer k 

Problem: Determine if there is a set L C with \L\ < k such that L 
contains at least one element from each subset in M. 



34 



FUMEI LAM, DAN GUSFIELD, AND SRINATH SRIDHAR 



The 3-hitting set problem is NP-complete [?] and the best known approximation 
algorithms for the problem have approximation ratio equal to three [?]. More re- 
cently, it has been shown that the 3-hitting set problem is fixed parameter tractable 
in the parameter k. In a series of papers, algorithms for solving the 3-hitting set 
problem have been given with running times 0(2.270'^ -I- n) [?], 0(2.179''' -I- n) [12], 
and 0(2.076*^ -I- n) [?]. It has also been shown that the 3-hitting set problem allows a 
linear-time kernelization, a preprocessing step typical for parameterized algorithms 
that converts a problem of input size n to an instance whose size depends only on k. 
In [?] , it is shown that any 3-Hitting-Set instance can be reduced into an equivalent 
instance of size at most 5fc^ + k elements. 

Using these results for the 3-hitting set problem, we obtain the following. 

Corollary 8.2. The Character Removal Problem for three-state input is fixed pa- 
rameter tractable. 

9. Conclusion 

We have studied the structure of the three state perfect phylogeny problem and 
shown that there is a necessary and sufficient condition for the existence of a perfect 
phylogeny for three state characters using triples of characters. This extends the 
extremely useful Splits Equivalence Theorem and Four Gamete Condition. The 
obvious extension of our work would be to discover similar results for r-state char- 
acters for r > 4. 

Until this work, the notion of a conflict, or incompatibility, graph has been de- 
fined for two state characters only. Our generalization of the four gamete condition 
allows us to generalize this notion to incompatibility on three state characters. The 
resulting incompatibility structure is a hypergraph, which can be used to solve al- 
gorithmic and theoretical problems for three state characters analogous to those for 
binary characters. 

In addition, there are several theoretical and practical results known for two 
state characters that are still open for characters on three or more states. For 
instance, it is known that the problem of constructing near-perfect phylogenies 
for two state characters is fixed parameter tractable; the analogous problem is 
open for characters on three or more states. Similarly, the question of whether 
incompatibility hypergraphs can be used to find decomposition theorems for lower 
bounding recombination events remains open for three or more states. With the 
recent increase in collection of polymorphism data such as micro/mini-satellites, 
there is a need for the analysis of perfect phylogenies to be extended to multiple 
state characters. Our work lays a solid theoretical foundation we hope will help 
with this effort. 
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